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Abstract 

The folding rates of two-state proteins have been found to correlate with simple 
measures of native-state topology. The most prominent among these measures is the 
relative contact order (CO), which is the average CO or 'localness' of all contacts in 
the native protein structure, divided by the chain length. Here, we test whether such 
measures can be generalized to capture the effect of chain crosslinks on the folding 
rate. Crosslinks change the chain connectivity and therefore also the localness of 
some of the the native contacts. These changes in localness can be taken into account 
by the graph-theoretical concept of effective contact order (ECO). The relative 
ECO, however, the natural extension of the relative CO for proteins with crosslinks, 
overestimates the changes in the folding rates caused by crosslinks. We suggest here a 
novel measure of native-state topology, the relative logCO, and its natural extension, 
the relative logECO. The relative logCO is the average value for the logarithm of 
the CO of all contacts, divided by the logarithm of the chain length. The relative 
log(E)CO reproduces the folding rates of a set of 26 two-state proteins without 
crosslinks with essentially the same high correlation coefficient as the relative CO. 
In addition, it also captures the folding rates of 8 two-state proteins with crosslinks. 



1 Introduction 



Small, single- domain proteins often are two-state folders. 1-3 These proteins 
fold from the denatured to the native state without populating experimen- 
tally detectable intermediate states. The folding times of two-proteins have 
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been found to vary over many orders of magnitude, from microseconds to sec- 
onds 1-3 . In 1998, Plaxco et al. 4 ' 5 made the remarkable observation that the 
the folding rates, the inverse folding times, correlate with a simple measure of 
native-state topology, the relative contact order (CO). Subsequently, compara- 
ble correlations have also been found for other simple measures of native-state 
topology such as 'long-range order', 6 the number of native contacts, 7,8 the 
'total contact distance', 9 'cliquishness', 10 and local secondary structure con- 
tent. 11 

A deeper understanding of these simple topological measures requires a test 
of their assumptions and implications. The most prominent topological mea- 
sure, the relative CO, is defined as the average CO or 'localness' of all native 
contacts, divided by the chain length N. The localness or CO of a contact 
between residues i and j is the number — of covalently connected residues 
between the two residues. The correlation between the relative CO and the 
folding rates of two-state proteins implies that proteins with many local con- 
tacts (e.g., a-helical proteins) fold faster than proteins with predominantly 
nonlocal contacts (e.g., some /3-sheet proteins). 1-3 The localness of a contact 
is a measure for the length of the loop that has to be closed to form the contact 
from the fully unfolded state. Since small loops in a flexible chain molecule on 
average close faster than larger loops, it seems understandable that proteins 
with small relative CO fold faster than proteins with larger relative CO. 

In this article, we consider a simple test of the loop-closure principle under- 
lying the relative CO. Introducing covalent chain crosslinks such as disulfide 
bonds into the protein chain decreases the localness of some of the native 
contacts, since the crosslinks 'short-circuit' the chain. The crosslinks typically 
lead to an increase in the folding rate, 12-15 which is in qualitative agreement 
with arguments based on native-state topology. Here, we test whether topo- 
logical measures of contact localness are able to reproduce these folding rate 
changes quantitatively. A natural extension of the localness of a contact in 
a crosslinked chain is the effective contact order (ECO), 16 ' 17 the minimum 
number of covalently connected residues between the two residues in contact. 
The ECO is a measure for the smallest loop that has to be closed to form 
a contact in a crosslinked, but otherwise unfolded chain. Without crosslinks, 
the ECO of a contact between two residues % and j reduces to the CO, the 
sequence separation \i — j\. 

We test and compare two topological measures based on localness. The first 
measure is the relative CO, and its natural extension for crosslinked chains, 
the relative ECO. The second measure is a novel measure, the relative logCO 
and its natural extension, the relative logECO. The relative logCO is defined 
as the average logarithm of the localness of all native contacts, devided by 
the logarithm of the chain length. The logarithm of the localness, i.e. the 
loop length, of a contact is an estimate for the chain entropy loss caused 
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by the loop closure. The relative logCO and logECO therefore may be seen 
as naive measures of entropic folding barriers. The relative CO and relative 
logCO exhibit essentially the same high correlations with the folding rates of 
26 two-state proteins without crosslinks. In addition, the relative logECO also 
captures the folding rates of 8 two-state proteins with crosslinks. The relative 
ECO, in contrast, seems to overestimate the folding rates of these proteins. 



2 Methods and results 

The relative ECO of a protein structure is defined as 

i M 

rel. ECO = — — V ECO(i) (1) 

1=1 

The sum is taken over all contacts % between non-hydrogen atoms of different 
residues, with total number M, and N is the chain length, the total number 
of residues. The ECO of contact % is the minimum number of covalently 
connected residues between the residues in contact. More precisely, the ECO 
is the length of the shortest path between the two residues of the contacting 
atoms, where each step on this path is a step between covalently connected 
residues. As Plaxco et al. 4 ' 5 , we define two non-hydrogen atoms to be in 
contact if their distance is less than 6 A. 

For proteins without crosslinks, the relative ECO of the protein structure is 
identical with the relative CO. Grantcharova et al. 3 have considered a set of 
26 proteins without crosslinks, extending a previous set of Plaxco et al. 5 by 
two proteins. In Fig. 1, the relative CO of these 26 proteins is plotted against 
the decadic logarithm of their folding rates (gray diamonds), together with the 
relative CO (open circles) and the relative ECO (filled circles) of 8 two-state 
proteins with crosslinks. For the 26 proteins without crosslinks, the Pearson 
correlation coefficient between folding rate and relative CO is 0.92. The line in 
Fig. 1 represents the regression line for this proteins. The position of the open 
circles above this regression line indicates that the relative CO of the 8 proteins 
with crosslinks underestimates the folding rates of these proteins. This is not 
unexpected, since the relative CO does not capture crosslinks, which speed 
up the folding process. The standard deviation of the open circles in vertical 
direction from the regression line is 1.42, which is significantly larger than 
the standard deviation of 0.61 for the 26 proteins without crosslinks. On the 
other hand, the relative ECO overestimates the folding rate of the proteins 
with crosslinks. The majority of the filled circles is located clearly below the 
regression line for the proteins without crosslinks, and the standard deviation 
of the 8 points from the regression line is 1.23. Despite the small number of 
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data points, this deviation for the relative ECO provides a relatively clear, 
negative answer, since it could only be 'compensated' in a much larger data 
set. For example, suppose we hypothetically add 8 'good' data points with 
the same standard deviation 0.61 as the 26 proteins without crosslinks to 
the 8 'poor' data points for the crosslinked proteins with standard deviation 
1.23. The resulting set of 16 data points still has a standard deviation of 
yj (1.23 2 + 0.61 2 )/2 = 0.97, which is significantly larger than the deviation 
0.61 for the proteins without crosslinks. 

In Fig. 2, we consider the relative logECO, a novel measure of native-state 
topology and chain connectivity, defined as 

i M 

rel. logECO = — — £ log [ECO(i)] (2) 

For the 26 proteins without crosslinks, the relative logECO is identical with 
the relative logCO = YhL\ log [CO(i)] /(M log N) . The relative logCO corre- 
lates with the foldings rates of these 26 proteins with a Pearson coefficient 
of 0.90, which is only slightly smaller than the correlation coefficient 0.92 for 
the relative CO. 2 In addition, the relative logECO captures the folding rates 
of the 8 proteins with crosslinks. The standard deviation of the filled circles 
from the regression line of the 26 proteins without crosslinks is 0.70 and, thus, 
comparable to the standard deviation 0.67 for these 26 proteins. The relative 
logECO therefore provides a simple estimator for the folding rates of two-state 
proteins both with and without crosslinks. 



3 Discussion and conclusions 

The correlation between folding rates and simple topological measures of two- 
state proteins has inspired various models of protein folding that are based on 
native-state topology. 18-34 A deeper understanding of the remarkable success of 
the topological measures in reproducing the folding rates of two-state proteins 
requires a thorough test of the implications of these measures. 35 Two-state 
proteins with crosslinks provide an excellent opportunity to test the 'localness 
hypothesis' of some of the measures. The relative logECO passes this test, at 

2 The small deviation between the correlation coefficients 0.90 and 0.92 is within 
reasonable error estimates of the coefficients. In a jack-knife approach, these errors 
can be estimated by considering, for example, all subsets of the 26 data points 
obtained by deleting up to two data points. For the relative ECO, the correlation 
coefficients of these subsets vary from 0.88 to 0.94, with a standard deviation of 
0.01. For the relative logECO, the correlation coefficients of the subsets range from 
0.86 to 0.93, with the same standard deviation 0.01. 
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least for a currently available set of 8 two-state proteins with crosslinks, and 
thus can be used to estimate the folding rates of two-state proteins both with 
and without crosslinks. 

The topological measures that have been considered here are based on physical 
loop-closure principles. The ECO of a contact is an estimate for the length 
of the loop that has to be closed to form this contact in the unfolded protein 
chain. For large loops, the logarithm of the loop length is proportional to the 
loop closure-entropy for forming this contact in the unfolded state. 22 ' 36-40 The 
logarithm of the ECO in Eq. (2) thus can be interpreted as a loop-closure 
entropy. The relative logECO is the average over the logarithm of the ECOs 
for all native contacts, multiplied by a prefactor 1/ log(iV) where N is the chain 
length. To interpret this prefactor, it is important to note that the average over 
the logarithm of the ECOs clearly overestimates the folding barrier. The reason 
is that the loop-closure cost for contacts formed late in the folding process can 
be reduced by contacts that have been formed earlier. 32-34 This overestimate 
should increase with the chain length N. The prefactor 1 / log(iV) in Eq. (2) 
therefore may be seen as a heuristic, chain-length dependent correction of 
this overestimate, and the relative logECO as a naive estimate of entropic 
loop-closure barriers for folding. 

Topological measures without chain-length dependent prefactors exhibit weaker 
correlations with the folding rates of two-state proteins. In the case of the 
relative CO, the prefactor is 1/JV. The related topological measure without 
this prefactor has been termed absolute CO. 3 ' 41 For the 26 proteins without 
crosslinks considered here, the correlation coefficient between absolute CO and 
the folding rates is 0.69, significantly smaller than the correlation coefficient 
0.92 for the relative CO. The correlation coefficient for the absolute logCO = 
E*=i log [CO(i)] /M is 0.80. This correlation coefficient is significantly smaller 
than the cofficient 0.90 for the relative logCO. 

Clearly, simply topological measure have limitations in reproducing or predict- 
ing folding rates. One of these limitations seems to be exemplified by the three 
src SH3 domain mutants with crosslinks listed in Table 2. The mutant with 
crosslink between residues 35 and 50 has the largest folding rate among the 
mutants. But the relative ECO and logECO of this mutant are only slightly 
smaller than the corresponding values for the mutant with crosslink between 
residues 1 and 25, and larger than the values for the circularized mutant with 
crosslink between residues 1 and 56. The reason seems to be that the crosslink 
between residues 35 and 50 stabilizes the distal hairpin of the src SH3 do- 
main. Mutational analysis of the wildtype src SH3 domain indicates that this 
/9-hairpin is a central structural element in the transition state for folding. 42 
This seems to explain why crosslinking the hairpin has a particularly strong 
impact on the folding rate. The effect of native-state topology and crosslinks 
on the kinetics thus can also depend on structural details of transition states 
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or native states beyond the overall localness of contacts in these states. 

Another limitation is that simple, topology-based measures can't capture 
sequence-dependent effects. Single-residue mutations and even relatively large 
changes in the sequence typically have a 'less than tenfold effect' 43 on the 
folding rate. These changes are comparable to the standard deviations of 
the folding rates from the regression lines in the correlation analysis of the 
topological measures. For the relative CO and logCO considered here, these 
standard deviations are between 0.6 and 0.7 on the decadic logarithmic scale 
(see above), which corresponds to an average error of a factor 5 in rate pre- 
dictions. For some proteins, however, larger mutation-induced changes in the 
folding rate have been observed. 44 ' 45 On average, the effect of 'topological 
mutations' such as the introduction or deletion of crosslinks on the folding ki- 
netics is significantly stronger than the effect of single-residue mutations. This 
is not astonishing, since these mutants change the connectivity of the protein 
chain, not only the local energetics. Other examples of 'topological mutants' 
are circular permutants in which the wild-type termini of the protein chain 
are connected and new termini are created by cleaving the chain somewhere 
else. 13,46-49 Circular permutation of the protein S6 has a drastic effect on the 
transition state, 48 which has been captured in a simple ECO-based model that 
predicts protein folding routes from native structures. 33 
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Table 1: Two-state proteins without crosslinks 
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"Experimental values for the folding rates kf are from Table 1 of Grantcharova et 
al. 3 ^Residues 3 to 43. c Residues 1 to 104. ^Residues 1 to 56. e Residues 20 to 83. 
f Residues 4A to 85A. ^Residues 1 to 62. ^Residues 803 to 891. - For NMR structures 
with multiple models, the values for the rel. CO and rel. logCO are averages over 
all models. Alternate locations for atoms in PDB files have been discarded to avoid 
double or triple counting of corresponding contacts. 
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Table 2: Two-state proteins with crosslinks 
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folding rates in water have been extrapolated using the rrif- values given in the references. ^Residues 20 to 83. c Residues 135 to 
190. d The N- and C-termini are crosslinked via an inserted glycine residue. To calculate the rel. ECO and rel. logECO for the 
circularized chain, we simply assume that this glycine residue makes 25 non-hydrogen atom contacts with the nearest neighbor 
residues 1 and 56, and 10 contacts with the next-nearest neighbor residues 2 and 55 (these are typical numbers for glycine 
residues), but makes no contacts with other residues. We also assume the residues 1 and 56 have 10 contacts in the circularized 
chain. 
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Fig. 1. Relative CO of 26 two-state proteins without crosslinks (gray diamonds), 
relative CO of 8 two-state proteins with crosslinks (open circles), and relative ECO 
of these 8 proteins (filled circles) plotted against the decadic logarithm of their 
folding rates kf. The regression line for the 26 proteins without crosslinks is given 
by logfc/ = 8.18 — 0.386 x (rel. CO) and provides a topology-based estimator for 
the folding rates of such proteins. The location of the majority of filled circles 
clearly below the regression indicates that the relative ECO, the natural extension 
of relative CO to proteins with crosslinks, tends to overestimate the folding rates of 
these proteins. The proteins are listed in the tables 1 and 2. 
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Fig. 2. Relative logCO of 26 two-state proteins without crosslinks (gray diamonds), 
relative logCO of 8 two-state proteins with crosslinks (open circles), and relative 
logECO of these 8 proteins (filled circles) plotted against the decadic logarithm of 
their folding rates. The regression line for the 26 proteins without crosslinks is given 
by log kf = 12.77 — 0.315 x (rel. logCO). The standard deviation in vertical direction 
from the regression line is 0.70 for the filled circles, which is only slightly larger than 
the standard deviation 0.67 for the gray diamonds. This indicates that the relative 
logECO provides a simple, topology-based estimator for the folding rates of proteins 
both with and without crosslinks. In the absence of crosslinks, the relative logECO 
is identical with the relative logCO. 
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